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Abstract 

We introduce in this paper an optimal first-order method that allows an easy and cheap 
evaluation of the local Lipschitz constant of the objective's gradient. This constant must ideally 
be chosen at every iteration as small as possible, while serving in an indispensable upper bound 
for the value of the objective function. In the previously existing variants of optimal first-order 
methods, this upper bound inequality was constructed from points computed during the current 
iteration. It was thus not possible to select the optimal value for this Lipschitz constant at the 
beginning of the iteration. 

In our variant, the upper bound inequality is constructed from points available before the 
current iteration, offering us the possibility to set the Lipschitz constant to its optimal value 
at once. This procedure, even if efficient in practice, presents a higher worse-case complexity 
than standard optimal first-order methods. We propose an alternative strategy that retains 
the practical efficiency of this procedure, while having an optimal worse-case complexity. We 
show how our generic scheme can be adapted for smoothing techniques, and perform numerical 
experiments on large scale eigenvalue minimization problems. As compared with standard 
optimal first-order methods, our schemes allows us to divide computation times by two to 
three orders of magnitude for the largest problems we considered. 

Keywords: Convex Optimization, First-Order Methods, Eigenvalue Optimization. 

1 Introduction 

With a few notable exceptions jGG05| , first-order methods constitute the main family of algorithms 
able to deal with very large-scale convex optimization problems |Peii08[ INeslOi IRTlll INesl2| . 
Among them, optimal first-order methods play a distinguished role: they are practically as cheap 
as a first-order method can be, with a complexity per iteration growing as a moderate polynomial 
of the problem's size, while the worst-case number of iterations they require is provably optimal for 
smooth instances |NY831 INes83j . Their scope of applicability is restricted to optimization problems 
with a differentiable convex objective function /, whose gradient is globally Lipschitz continuous. 
Nevertheless, Nesterov introduced a systematic procedure, applicable to many nonsmooth convex 
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functions, for building a smooth approximation to which one can apply an optimal first-order 
minimization algorithm citenesterov:coreDP12/2003. His construction can easily be specified to 
realize the optimal compromise between the smoothness of the substitute objective function and 
how accurately it approximates the original objective. Smoothing techniques extended dramatically 
the scope of optimal first-order methods, and many variants of the original scheme developed in 
[Nes83| have been studied since then (see e.g. |Tse08[ IBT09[ Id'AOSl ILLM] ). 

Critically, optimal first-order methods need an estimation of the corresponding Lipschitz con- 
stant with respect to an appropriate norm. Originally, this bound is used to build an approximation 
of the epigraph of the objective function. The larger the bound, the worse this approximation is, 
and the more steps the method is likely to take. Some strategies have been proposed to re-actualize 
at every step this bound |Nes07a[ IBCGllj . These strategies are based on the fact that the Lips- 
chitz constant is used at a particular iteration to satisfy a single inequality rather than as a global 
property. If this inequality is verified, it suffices to reduce this constant, redo the iteration with the 
new value, and recheck the inequality, until it is no longer satisfied. If the inequality is not verified, 
we simply multiply the constant by an appropriate value and re-perform the iteration as long as 
the inequality does not hold. This strategy yielded a significant increase in practical efficiency. 
However, the cost of a single iteration has to be multiplied by a number ranging between two and 
possibly a few dozen. 

Wc show in this paper how a slight modification of these methods allows us to choose inexpen- 
sively the smallest possible approximation that guarantees the global convergence of the method. 
In particular, we avoid redoing several times the work needed for one iteration. The practical effect 
of such a procedure is appreciable, and is documented at the end of this paper. On the theoretical 
side, we show that our re-evaluation of the Lipschitz constant, if applied systematically, gives an 
algorithm which requires at worse 0{{LD)/e) iterations, where L is the global Lipschitz of the 
objective's gradient, D measures the diameter of the feasible set, and e > is the desired absolute 
accuracy on the objective's value. In comparison with the vanilla optimal first-order method, which 
has a complexity of 0{^/{LD)Je) iterations, this algorithm is clearly worse. We propose a mixed 
strategy that presents simultaneously the practical efficiency of our systematic method for very 
large- scale problems and, up to a constant that we can take as close to I as desired, the theoretical 
efficiency of optimal methods. 

When we apply to smoothing techniques, our mixed strategy suggests a different choice of the 
smoothness parameter than the standard one. This fact should not be too surprising: as our method 
is precisely designed to fit appropriate local estimates of the gradient's Lipschitz constant, it allows 
us to be slightly sloppier in our request for global smoothness. 

In order to validate our general scheme, we consider a well-known application of smoothing 
techniques to the problem of minimizing the largest eigenvalue of a convex combination of given 
symmetric matrices |Nes07b| . This problem has many applications, and a large variety of methods 
have been devised to solve it jHROOl IAK07j , some of which are adaptations of optimal first-order 
methods |NJLS09l IBBNlll ldK12j . To the best of our knowledge, these methods improve the 
complexity of each step of optimal first-order methods, but are not trying to decrease the number 
of these steps. With respect to original smoothing techniques, our method allows us to divide by 
hundreds the practical number of iterations for large-scale instances, that is, when we have 100 
matrices of size larger than 200. Our methods even allowed us to deal with a problem involving 
10% sparse matrices of dimension 12,800 within 9 hours, while standard optimal first-order methods 
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would have taken more than one year if it were to perform all the iterations predicted by the worse- 
case analysis. It appears in practice that the standard optimal method needs about two third of 
these iterations: about eight months would be needed to solve that problem. 

The paper is organized as follows. We outline our method in Section 2. First, we analyze its 
complexity for smooth convex problems and particularize our result to the two variants mentioned 
above. Then, wc describe how the algorithm and its variants can be particularized to smoothed 
problems. In Section 3, we apply these methods to the eigenvalue minimization problem and present 
some numerical experiments. We have relegated the rather technical proof of the main theorem of 
the paper in the appendix. 

2 An accelerated optimal first-order method 

In this section, we introduce an accelerated version of Nesterov's optimal first-order method that is 
presented in |Nes05j and discuss its application in smoothing techniques. 

2.1 General algorithm 

We start by considering the following optimization problem: 

r=min/(a;), (1) 

xeQ 

where Q is a closed and convex subset of R" and / : R" — > R is a function, which is supposed to 
attain its minimum on the set Q. In addition, we assume that / is convex and differentiable with 
a Lipschitz continuous gradient on Q. 

We consider R" with the standard Euclidean scalar product, which is denoted by (•,•). The 
space R" is equipped with a norm IHIr™, which may differ from the norm that is induced by the 
scalar product. We write |H|jj„ ^ for the dual norm to ||-||jjn: 

:= max{(w,x} : ||a;||K„ =1}, w G R". 

As / has a Lipschitz continuous gradient on Q, there exists a constant L = L{Q) > which 
satisfies the inequality: 

l|V/(.T)-V/(y)||R„ , <L||z-2/||k„ yx,yeQ. (2) 

Nesterov developed a first-order method (see Equations (5.6) in |Nes05| ) that allows us to compute 
approximate solutions to Problem ([1]). This optimal first-order method has a convergence rate 
of O (^L/T^^, which outperforms the rate of convergence of common subgradient methods by two 
orders of magnitude. We quickly recall that common subgradient methods converge with the order 
0{l/T°-^y, see for instance !NY83j . 

At every step of Nesterov's optimal first-order method, the Lipschitz constant L is used to 
update the iterates; see |Nes05| for the details. However, the constant L is a global parameter of 
the function /, as L needs to satisfy Condition ([2]) on the whole set Q. In this subsection, we 
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introduce a refined version of Nesterov's optimal first-order method, where we replace the global 
parameter L by local estimates. 

This algorithm requires the following basic notions. We say that dq : Q ^ M>o is a distance- 
generating function for the set Q if it complies with the following requirements: 

1. dg is continuous on Q; 

2. dg is strongly convex with modulus 1 on Q: 

dgiXx + [1 - X]y) + \\x _ y\\l^ < XdQix) + [1 - X]dQ{y) V x,y e Q; 

3. given the set Q°[dQ) := {x € Q : ddqix) ^ 0}, the subdifferential ddq gives rise to a contin- 
uous selection d'q on the set Q°. If there is no possibility for confusion, we write Q° instead 
of Q^dq). 

Let dq be a distance-generating function for the set Q and choose z £ Q°. We write 

Vf'^ix) = dqix) ~ dq{z) -{d'q{z),x-z) e R>o 

for the Bregman distance of x € Q with respect to z € Q°. Nesterov's optimal first-order method 
and its accelerated version that we present in this paper utilize a prox-mapping, that is, a mapping 
of the form: 

ProxQ«^ : R" ^ g° : s arg min { (s, x ~ z) + (x) } , z£Q°. (3) 

If there is no possibility for confusion, we abbreviate vf"^ and Proxg"^ into T4 and Proxg^^, respec- 
tively. Given s £ R" and z G Q° , the value ProxQ.z(s) can be rewritten as 

ProxQ,^(s) = arg mini (s - dQ(z),x) + dq{x)] . 

It can be easily verified that this optimization problem has indeed a unique minimizer (Note that 
the objective function x i— ?• (s — d'q{z)^x) + dq{x) is continuous and strongly convex. It remains 
to apply Lemma 6 from |Nes09| .) and that this minimizer belongs to Q° . For the reminder of this 
paper, we assume that this minimizer can be computed easily (Ideally, we can write it in a closed 
form.). The unique element 

c{dq) := argmin {dg (a;)} G Q° 

xeQ 

is called the dg-center (Note that c{dq) — Proxq^z{dq{z)) for any z G Q°-)- Without loss of 
generality, we may assume that dq vanishes at the point c{dq). Then, Lemma 6 in |Nes09j can be 
used to justify the following inequality: 

dqix)>^\\x-c{dq)\\l„ VxeQ. (4) 

We discuss now the analytical complexity of the accelerated optimal first-order method displaj^ed 
in Algorithm [H We choose T G No := N U {0} and assume that the sequences (a;t)t=^)^' (^t)tJo^; 
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Algorithm 1 Accelerated optimal first-order method 



1: Choose T e No. 

2: Choose ht)J=o with 70 £ (0, 1], 7* > 0, and 7^ < Tt ELo 7fc for any < i < T + 1. 
3: Set Lo = L and xo = c{dQ). 

4; Compute uo := argmin^jeg {70 (/(xo) + (V/(2;o), x - .to)) + Lodqix)}. 
5: Set zo = Uo, To = 71/ri, and xi = tqzo + (1 - to)uo = zo- 
Define xi := ProxQ,^ (7iV/(a;i)/Lo). 
Set ui = Toil + (1 — ro)uo. 
for 1 < t < T do 

Choose < Lt < L such that: 

f{ut) < f{xt) + (V/(a;t), Mt -xt) + Y 11^* ~ ^* IIk" ■ (S) 

10: Set zt = argmiujjeg \ j2k=olk {f{xk) + {Vf{xk),x~ Xk)) + Lt(iQ(a;)|. 
11: Set Tt = 7i+i/rt+i and Xt+i = nzt + (1 - Tt)ut. 
12: Compute xt+i := Proxg^^^ (74+1 V/(xi+i)/Li). 

13: Set Ut+1 = TtXt+1 + {I - Tt)ut. 

14: end for 

(^t)f=o' (^t)?^/' (7t)fJo^ (^t)^Jo^ (■^OLo, and {Lt)J^Q are generated by Algorithm [TJ Given 
< i < 2^, we say that Inequality (It) holds if 

where 

V-t := min \ 7fc (/(•'z^fc) + (V/(xfc),a; - Xfe)) + Ltdqix) \ . 
xeQ ^ — ' 

U=o ) 
As the proof of the following result is rather long and technical, we give it in the Appendix lAl 

Theorem 2.1 Inequality (It) holds for any < t < T. 

For the reminder of this subsection, we refer to x* £ Q as an optimal solution to the optimization 
problem /* = min^^eg f{x). 

Theorem 2.2 For any T G No, we have: 

T-l 



1 T 



Lrdqix*) + ^{Lt~ U+i) [dqizt+i) - ^ \\zt - Xt+i\\\^ 



t=o 

Proof: Let < t <T. The convexity of the function / and the definition of Tt imply 
ipt ■■= min < Ltdqix) + Y] 7fe {f{xk) + (V/(xfc), x - Xk)) \ 
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< LtdQ(a;*) +^7fc ifM + (V/(xfc),x* - Xk)) 



k=0 
t 



< LtdQ{x*)+Y,lkf{x*) 
= LtdQix*)+r~tfix*). 

It remains to combine this inequality with Theorem 12.11 ■ 
Nesterov [NesOSj suggests to choose the sequence (7*)^^ as 

7*:=^ VO<t<T+l. (6) 

Lemma 2 of [NesOSj shows that wc have the following equations for this choice of the sequence 
(7t)t=o : 

2 

Tt V < i < T 

t + 3 - - 

and 

r.= (^±^^, 7.^<r, vo<t<T + i. 

As an immediate consequence of Theorem 12.21 we obtain the following result for our accelerated 
optimal first-order method. 

Corollary 2.1 Let us choose the sequence {'jt)'tJo in Algorithm\^ as described in Then, we 

have for any T e Nq 

" ^ - (r + i)(T + 2) + {^^ (r + i)(T + 2) - 2 - j ■ 

■ 

There exist different strategies for updating the sequence {Lt)f^Q in Algorithm[T] When a := 0, 
we recover the complexity results of Nesterov's optimal first-order method (see e.g. Subsection 5.3 
of |Nes05| ). for which Inequality ^ can be rewritten as /{ut) — f* < (^x+i)(t+2) ' 

Alternative 1: (most aggressive adaptive setting) Fix < k <C 1 and let 1 < t < T. The 
most aggressive choice for the constant Lt corresponds to 

r /r r\^r r n f 2[f{ut)-f{xt)-{\/f{xt),ut~xt)] ^ ^ 

Lt := max \Lt, KLj G [kL, L\, Lt := 5 — ^- \°) 

\\ut -Xt||jj„ 

The computation of the constant Lt requires the entities wt, xt, and V/(a;t). In sharp contrast with 
the methods proposed so far |Nes07a[ IBCGll) , all these entities are known from the previous step 
t — 1, implying that the constant Lt can be determined immediately. 

Independent of the choice of the it's, we can always derive the following trivial convergence 
result for Algorithm [T] from Inequality ^ : 

iLsup^^Qdgjx) 20LTsnp.^^QdQ{x) ^ 20Lsup^^QdQ{x) 
^ - (T + l)(r + 2) (r+l)(T + 2) - T + 2 
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as Lxd^x*) < Lsup,j.^QdQ{x) and 



T-l 



{Lt - Lt+i) idqizt+i) - i 



Ft - xt+i\ 



T-l 



< ^|Lt-Lt+i| (dQ{zt+i) + \ 



t=o 

T-l 



t=0 Va^eQ a:eQ / 



= 5LT sup dQ{x). 

Note that the last inequaUty holds due to 

Thus, Algorithm [T] equipped with the most aggressive update strategy, whieh is described in 
(El), needs at most 



T = 



20L supd{x)/e - 2 



iterations to find a feasible e-solution, provided that supj.gQ d{x) is finite. 

Alternative 2: (hybrid setting) Finally, we ean combine the two settings that are presented 
above. We choose a number a > and denote by 1 < < < T the current iteration. As long as 



t-l . ^ 

{Lk - Lk+i) ( dQ(zfe+i) - - 



fc=0 



|Zfc - Xfe+l 



<aLdQ{x*) yi<t<t, 



(9) 



we use the update strategy that is described in ([8|) . When Condition ([9]) is not satisfied for the first 
time, we set Lf := L for any t > t and recompute the point zt- 

With the just specified setting, Inequality ([7]) results in the bound 



4{l + a)LdQ{x*) 
(T+l)(T + 2) ■ 



That is, we need to perform at most 



T 



2^{l + a)LdQ{x*)/e-l 



iterations of Algorithm [T] to find a point x € Q with f{x) — /* < e, where e > 0. This complexity 
result deviates by a factor of (1 + a)'^'^ from the efficiency estimate of the non-adaptive method. 
With a = 5(r + 1) sup^gQ d{x) — 1, the setting coincides with Alternative 1. 



2.2 The accelerated optimal first-order method in smoothing techniques 

Smoothing techniques jNesOSj constitute a two-stage procedure that can be applied to non-smooth 
optimization problems with a very particular structure. In a first step, a smooth approximation 
of the non-smooth objective function is formed, so that Nesterov's optimal first-order method can 
be applied afterwards. In this section, we study the effects of replacing Nesterov's original optimal 
first-order method by its accelerated version in smoothing techniques. 
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We assume that the sets Qi C M" and Q2 C M™ are both compact and convex. In addition, we 
endow the spaces M" and K™ with two (maybe different) norms. We denote by \\-\\^r, and IMIr™ 
the norm of the spaces M" and M™, respectively. Ncsterov considers convex optimization problems 
of the form: 

min max(^(a;,y), (l){x,y) := fi{x) + {A{x),y) - f2{y), (10) 

where /i : M" ^ R, /2 : K™ — !• M arc smooth and convex, and A : M" R"' is a linear operator. 
With a slight abuse of notation, we write (•, •) for the Euclidean scalar product in both spaces M" 
and M™. 

According to the standard MiniMax Theorem in Convex Analysis (see Corollary 37.3.2 in 
[Roc70j ). we have, due to the compactness and convexity of the sets Qi and Q2, the following 
pair of primal-dual convex optimization problems: 

min < 4>{x) :— max. (j){x,y) > = max < </)(y) := min (j){x,y) 

x&Qi I y&Q2 J V&Q2 I— x£Qi 

The operator A comes with an adjoint operator A* : W" — !> R", which is defined by the relation: 

{A{x),y) = {x,A*{y)) V (a:,y) eR" x R™. 

The analysis of Nesterov's smoothing techniques requires a norm of the operator A. This norm is 
constructed as follows: 

PIIk".R". := ,„^ax {(^(a;),2/) : \\x\\^^ = 1, \\y\\^^ = 1}. 

We are ready to form a smooth approximation of (f) to which we can apply Algorithm [1] We 
choose a distance-generating function dq^ : Q2 R>o for the set Q2 and consider the auxiliary 
function 

0^ : R" ^ R : X- max {h{x) + {A{x),y) - ^(y) - fidg, (y)} , 

where > is a positive smoothness parameter. This function defines a uniform approximation of 

(j), as 

'^Jx) <'(l){x) <'(i) {x) + fimaxdQ^{z) V x £ Qi; (11) 

z(^Q2 

see Inequality (2.7) in jNesOSj . The function y {A{x),y) — .f2{y) — A^^Qa iv) is strongly concave for 
any x € Qi, as the distance-generating function dg^ is strongly convex by its definition. Hence, the 
function y 1— >■ {A{x), y) — /2(y) — /^^Qa (y) has a unique maximizer on Q2- We denote this maximizer 
by y*{x). 

Ncsterov showed that 4>^ is differentiable with a Lipschitz continuous gradient. We write M > 
for the Lipschitz constant of the gradient of /i . 

Theorem 2.3 (Theorem 1 in |Nes05] ) The function cj)^ is well-defined, continuously differen- 
tiable, and convex on R" . The gradient of 4>^ takes the form 

V4>,{x)^Vfi{x)+A*{y.{x)), 

and is Lipschitz continuous with the constant := AI + ||.4||g„ / ^i. 
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Algorithm 2 Algorithm [T] (with 7^ = (t + l)/2) applied to Problem ([T^ 
1: Choose T G No. 

2: Choose a smoothness parameter /i > and a distance-generating function dg^ : Qi — > M for 
the set Qi. 

3: Set Lq = L^ = M + ||^||r„ u,„ /^Ji and xq = c((iQj. 

4: Set uo = argmin^jgQi {cj)^^{xo) + {Vcj)^{xo),x - xo)) + LQdQ^{x)]. 

5: Set zo = "0, = |, and xi = toZq + (1 ~ to)uq — zq. 

6: Define xi := Proxg^^^ (V(/)^(a;i)/io) ■ 

7; Set Ml = TqXi + (1 — Tq)uq. 

8; for 1 < t < T do 

9: Choose < Lt < such that: 

'$f,{ut) < lif.ixt) + {\7'$^{xt),ut -Xt) + Y ll^t ~ ^tllR" ■ 

10: Set 

zt = arg min I ^ — ^— ($^{xk) + (V0^(xfc),x - Xk)) + LtdQ,{x) I . 
""^^^ U=o J 

11: Set Tt = and Xt+i = TtZ( + (1 - Tt)itt. 

12: Compute it+i := Proxg^.^^ {^\/(j)^{xt+i)/Lt) ■ 

13: Set Uf+l = TtXt+l + (1 - Tt)Mt. 

14: end for 



As an immediate consequence, we can apply Algorithm [T] to the problem: 

min0^(a;). (12) 

Algorithm[2] corresponds to Algorithm[l]when we apply this method with step-sizes as described 
in ^ to Problem ([T^ . A slight adaptation of the proof of Theorem 3 in [NesOSj yields to the 
following result, for which we need the definitions: 

Di := max dQ-^{x) and D2 := T[ia:ii dq^^y)- 

Theorem 2.4 Fix T £ No and assume that the sequences {xt)^^Q , {ut)f.^Q , {zt)t^Q, (^t)t=i 7 '^'^'^ 
{Lt)f^Q are generated by Algorithmic with the smoothness parameter /j, > 0. For 

T 

x:^ut£Qi and 1/ — X] (T -|- l)(r -|- 2) ^*'-^'*'' ^ 



we have: 



{x)-m<— [T+iy ^+/i^2, (13) 



9 



whe 

T-^ / 1 2 



For remainder of this section, we use the notations of Algorithm [T] and Theorem 12.41 
Proof: In accordance to Theorem 12 . 1 1 and to the step-size choice (|6]), we have the inequahty: 



O.jix) — (P,,[ut) < z r-, r + mm — -, (14) 

H'^y I ^/^^ - (T+ l)(T + 2) i^GQi (r+ l)(T + 2)' ^ ' 

where 

T 

Prix) := Y.{t + 1) ($^{xt) + (V^^ixt), X - xt)) V x e Qi. 



Let x & Qi. Using Theorem 12.31 and the convexity of /i and /2, we can write: 

T 

Pt{x) = ^(t + 1) + {A{x),y4xt)) - f2{y*{xt)) - lidQ^Xt))) 

t=o 

T 

< J2it + 1) ihix) - {A{x),y4xt)) - f2{y.{xt))) 

t=0 

< ^^"^^2^^^^ + {A{x), y) - fm . 



The above inequahty implies: 



inin h{x) < ^ ' m- (15) 



RecaU that we have Lt < L^— ||^||^„ g,„ //i + M by construction. We use Inequahties p4)) . (fTS)) . 
and ([TT|) to justify the following inequalities: 

(r + i)2 (r + i)(r + 2) - "^^^""^ " - "^^""^ ~ ^^^^ " ' 

■ 

We conclude this section by discussing different strategies for choosing the sequence (Lt)t^o ^^'^ 
the smoothness parameter fi. 

Alternative 1: (most aggressive adaptive setting) We can always give the following upper 
bound for the quantity {—Xt) in Theorem 12.41 

-XT < 5L^DiT = 5D,T ^ + m) , 

which allows us to reformulate ((T5)) as 



2QD,{\\A\\l„^^^/^i + M 



(x) - ±{y) < ^ ^7^^ + ^^D2. 
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Minimizing the right-hand side of the above inequahty with respect to ^, that is, setting /.i to 
we obtain: 



' 5Di 

(r + i)z?2' 



ci>{x)-m<M\A\\, 



5D1D2 20D1M 

T+l ^ T+1 



As this bound is independent of the choice the Lt's, it is vahd also for the most aggressive setting, 
that is, for 

T If r -I ^ r r r 1 f 2 [^(ut) - ^(a;*) - ( V0(a;t), - 2:4)] 

Lt := max |Lt,KL^| G [Ki^,L^J, Lt := — '—^ — < (16) 

\\ut-xt\\^,, 

where 1 <t <T and < k < 1 is fixed. 

Alternative 2: (hybrid setting) Let a > 0. We follow the setting described in (fT6|) for all 

1 < f < r as long as 



is satisfied for any 1 < t < t. When this condition fails for the first time, say for t = t', we set Lt 
to for any t >t' and recompute the point Zf. In this hybrid setting, Inequality (|13p yields to 

4(l + a)i?i (mII^ r,„/^ + m) 

m - m < {T+iy + 

We choose /i such that the right-hand side of the above inequality is minimized, that is, we fix fj. to 



2PllRn,R^. / (1 + a)Di 

' T + 1 y D2 

and end up with the following bound: 



/ 4 P||h„_r™ \/(1 + a)i?ii?2 4(l + a)i?iAf 
Note that Alternative 2 coincides with Alternative lifQ! = 5(r+l) — 1. 



3 An application in large-scale eigenvalue optimization 

In this section, we study the practical behavior of accelerated smoothing techniques. We apply 
them to the problem of finding a convex combination of given symmetric matrices such that the 
maximal eigenvalue of the resulting matrix is minimal. 
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3.1 Problem description 

Let 

{m 

be the (m — l)-dimensional probability simplex. Denoting by Sn the space of symmetric real (n x n)- 
matrices, we write F ^ if F £ iS„ is positive semidefinite and Tr(y) := J27=i ^'-''^ trace of 
Y. We refer to 

-.^{YhO: Tr(r) = 1} C S„ 
as the simplex in matrix form. Finally, we denote by 

Xn{Y) > ... > Xl{Y) 

the eigenvalues of the symmetric matrix Y >z and assume that they are ordered decreasingly. 
Throughout this section, we consider the following problem: 



min A„ I ^ XjAj j = ^mn j (/)(a;) := ^inax ^ {A^ f\ ^ (1'^) 



where . . . , Am G Sn and (•, ■) p denotes the Frobcnius scalar product. 

3.2 Applying accelerated smoothing techniques 
3.2.1 Smoothing the objective function 

We equip iS„ with the induced 1-norm, that is, with := X]r=i where Y G 5„. The 

dual norm corresponds to the induced cxD-norm, that is, to the norm niaxi<i<„ |Ai(VF) 

with W ^ Sn- We choose 

n 

d^M{Y) := ln(n) + ^ K{Y) ln(A,;(y)), Y G Af , 

i=l 

as distance-generating function for the set A^^, for which we have d^M (Y) < ln(n) for any Y G A^^; 
sec for instance |Nes07b| for a proof that d^M is a distance-generating function for A^. We obtain 
the following smooth objective function as an approximation to 0: 



^{x) := ^inax^ ^ ^ a;^ {Aj,Y)p - fid^M{Y) \ = n\n [Yl'^^P 



/i ln(n) 



where x G A,„ and /i > denotes the smoothness parameter. The approximation quality depends 
on the smoothness parameter: 

(t>aix) < (f>{x) < (f> {x) + n\ii{n) V a; G Am. 
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Finally, the gradient of (f)^ is given by 

[V^^(x)]^. = V 1 < j < m, 

where x G A„j and [x) denotes the unique maximizer of 1" t-^ ^^=1 (^i i ^) f ^ ^d^M (Y) over 
A^^ . Theorem 12.31 implies that the gradient is Lipschitz continuous with a Lipschitz constant of 
Lf, maxi<j<„ / ^i. 

3.2.2 Applying the accelerated optimal first-order method Vifith hybrid setting 

Let the space R™ be equipped with the 1-norm. We use 

ra 

dA„S^) •= ln(m) + ^ Xj ln(a;j), x £ A^, 

as distance-generating function for the set A,„. Note that dA^{x) < In(TO) for any x g Am- 

We run Algorithm [5] with the hybrid setting that is described in Alternative 2 in Section 12.21 
Let us fix the accuracy e > and the parameter a > that defines when to switch back to the 
non-adaptive setting. The smoothness parameter is set as follows: 

e 



2 ln(n) 

Note that the smoothness parameter does not depend on a. According to Theorem l2.41 we need to 
perform at most 



T 



4maxi<j<„i y/{l + a) ln(m) ln(n) ^ 



e 

iterations of Algorithm [5] in order to find a tuple {x,Y) e A,„ x A^^ such that 



(18) 



3.3 Numerical results 

We consider randomly generated instances of Problem (jlOp . where we fix m to 100 and where the 
symmetric (n x n)-matrices Ai, . . . ,Am have a joint sparsity structure, each of them with about 
n^/10 non-zero entries. We approximate the parameter 

C max \\Aj\\^^^ 

l<j<m ^ ' 

by applying the Power method to the matrices Aj and taking the maximum, which we denote by 
of the computed values afterwards. We solve the randomly generated instances of Problem ([TU| 
up to a relative accuracy of tC! with e := 0.002. 

All numerical results that we present in this section are averaged over ten runs and obtained 
on a computer with 24 processors, each of them with 2.67 GHz, and with 96 GB of RAM. The 
methods are implemented in Matlab (version R2012a). Matrix exponentials are computed through 
the Matlab built-in function expm ( ) . 
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Average CPU time [sec] 



n 


100 200 400 800 


Original smoothing techniques 
Accelerated smoothing techniques 


139 366 1'406 5'961 
116 3 9 32 


Acceleration 


16.55% 99.18% 99.36% 99.46% 


Average # of iterations that are required in practice 


n 


100 200 400 800 


Original smoothing techniques 
Accelerated smoothing techniques 


6'180 6'690 7'150 7'520 
4'918 18 14 13 


Reduction 


20.42% 99.73% 99.80% 99.83% 



Average # of iterations that are required in theory 



n 


100 


200 


400 


800 


Original smoothing techniques 
Accelerated smoothing techniques 


9'210 
18'420 


9'879 
19'758 


10'505 
21'011 


11'096 
22'193 


Reduction 


-100.00% 


-100.00% 


-100.01% 


-100.01% 



Table 1: Average CPU time and number of iterations (in practice and in theory) that are required 
by original and accelerated smoothing techniques for finding an approximate solution to randomly 
generated instances of Problem pUj) (with fixed accuracy 0.002>C' and with m — 100). 

3.3.1 Comparing the practical behavior of different methods 

In Table [U we present numerical results for the following two methods: 

o Original smoothing techniques: This implementation corresponds to Algorithm [5] with con- 
stant Lt = for any < t < T. That is, we set a = in Alternative 2 in Section [2?2] 

o Accelerated smoothing techniques: We equip Algorithm [2] with the hybrid setting described 
in Alternative 2 in Section where we choose a := 3 and k := 10~^^. With this setting, we 
need to perform twice as many iterations as with original smoothing techniques with respect 
to the worst-case bounds; see p^ . 

For both methods, we check the duality gap (fT9)) at every 100-th iteration. Additionally for the 
later method, we also verify this condition at every of the first hundred iterations. The maximal 
eigenvalue that corresponds to the first term in ([T5)) is computed through the Matlab built-in 
functions maxO and eigO. 

We observe that accelerated smoothing techniques require significantly less CPU time and iter- 
ations in practice than original smoothing techniques; see Table [T] For problems involving matrices 
of size 200 x 200 up to size 800 x 800, we can reduce the number of iterations in practice and the CPU 
time by more than 99%. Interestingly, the number of iterations that are required by accelerated 
smoothing techniques in practice is even decaying when the matrix size n is getting larger. 

Note that there exists a gap in the average CPU time and number of iterations that are required 
by accelerated smoothing techniques in practice for solving the instances of size 100 x 100 and the 
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instances of size 200 x 200. In Figure [U we plot the values 



-Xt 



L, 



') (c?A,„(2t'+i) - 5 \\zt' ~Xt'+l\\fj 



ln(m)Lo 



DiLo 



yt>i. 



(20) 



In contrast to the cases n ~ 200, n = 400, or n = 800, where these values remain small (that is, 
below 0.25), we have considerably large values fUt for n = 100. However, the values are still below 
3, as we switch back to a non-adaptive setting as soon as /3t would be larger than 3. This behavior 
is in full accordance with the gap mentioned in the beginning of this paragraph. The non-smooth 
patterns at the end of the plots in Figure [T] are due to the averaging over the different runs (We 
may need a different number of iterations in the different runs.). 



2.5 
2 

c=r 1.5 
1 

0.5 





1000 2000 3000 4000 
number of iterations 



5000 6000 



3 
2.5 
2 

coT 1.5 
1 

0.5 




10 15 
number of iterations 



20 



2.5 
2 

cqT 1.5 
1 

0.5 





6 8 10 

number of iterations 



2.5 
2 

coT 1.5 
1 

0.5 




4 6 8 10 

number of iterations 



Figure 1: Ratios /3t; see ([20|) for the definition of these ratios. 



3.3.2 Solving problems of very large scale 

In Table [5J we show numerical results for accelerated smoothing techniques (with a = 3, k = 
10~^^, and the same duality gap checking procedure as above) when applied to randomly generated 
instances of P7)) that are of very large scale. Using accelerated smoothing techniques, we are able 
to solve approximately instances of ((T7)) involving matrices of size 12'800 x 12'800 in about 8 hours 
and 40 minutes on average. Clearly, this performance would be out of reach for original smoothing 
techniques. 
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Accelerated smoothing techniques applied to large-scale instances of (|17p 



n 


1'600 3'200 6'400 12'800 


CPU time [sec] 

Average # of iterations that are required in practice 
Average # of iterations tliat are required in tlieory 


158 791 4'566 31'240 
13 13 13 13 
23'315 24'386 25'411 26'397 



Table 2: Average CPU time and number of iterations (in practice and in theory) that are required 
by accelerated smoothing techniques for finding an approximate solution to randomly generated 
large-scale instances of Problem ([T0|) (with fixed accuracy 0.002>C' and with m = 100). 
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A Proof of Theorem 12.11 

Choose T € No and let the sequences (a;t)fJo^ ("t)So^ (^t)Lo' i^t)f=i, {lt)J=o , (rt)fj^o^ ('^t)Lo' 
and {Lt)J^Q be generated by Algorithm [1] Recall that Inequality (It) holds for < t < T if 

^tf{ut) + Y{Lk+i - Lk) (^dQ{zk+i) - ^\\zk - Xk+i\\'ir?j <A, i^t) 

where 

ipt := min<^ V 7^. {f{xk) + {Vf{xk),x- Xk)) + Ltdgix) \ . 
U=o ) 

By its definition (see Algorithm [1]) , the element zt G Q is the minimizer to the above optimization 
problem, which allows us to rewrite ipt as: 

t 

'ipt = 'Y7k {.f{xk) + {Vf{xk), zt - Xk)) + Ltdqizt). 

k=0 

We show by induction that Inequality (2^) holds for any < t < T. 
Lemma A.l Inequality (Iq) holds, that is, we have 70/(^0) < V'o- 

Proof: We apply the definition of Uq (see Algorithm [T|) . Inequality the condition on 70 saying 
that 7o e (0, 1], and Theorem 2.1.5 in |Nes03j in order to justify the following relations: 

V'o nim{'yo{f{xo) + {Vf{xo,x-xo)) + LodQ{x)} 
xeQ 

= 70 {f{xo) + (V/(2:o), Wo - 2^0)) + LodQ{uo) 
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> 70 if{xo) + (V/(xo), uo - xq)) + ^ \\uo - xo\\l.r 

> 70 (^f{xo) + (V/(xo),uo -xo) + ^ \\uo -a^ollE^ 

> 7o/(wo)- 

Let us verify the inductive step. 
Lemma A. 2 Let < t < T — 1. If Inequality (It) holds, also (It+i) *s true. 
Proof: Let < t < T — 1 and assume that (If) holds. We make the following two definitions: 

(Lfc+i - ifc) (dQ{zk+i) - ^ \\zk ~ ifc+i||R„ ) G M, 



t-i 



k=0 ^ 
t 

St := ^7fcV/(xfc) eM". 

fc=0 

In addition, we define the linear function: 

t 



lt:Q^M.:x^ lt{x) =Ylk {f{xk) + (V/(a;fc), x - x^)) . 

fc=0 

Choose a; G Q. The definition of zt implies: 

0< /itVdQ(zt) +^7fcV/(xfc),a;- zA ^ {LtVdqizt) + St,x ~ zt) . (21) 
\ fc=0 / 

As the Inequality {It) holds and as the function / is convex, we have: 

ipt > ^tf{ut) + Xt > Lt {f{xt+i) + {Wf{xt+i), ut - xt+i)) + Xt- 

This implies: 

V't + 7t+i {f{xt+i) + {Vf{xt+i), X - xt+i)) > Tt+ifixt+i) + "ft+i {Vf{xt+i),x - Zt) + Xt, 

where we use the relations Ft^-i = Ft + "ft+i and 

Tt{ut - xt+i) + 7t+i(a; - xt+i) = TtUt - Tt+iXt+i + Jt+ix 

= TtUt ~ Ft+i {nzt + (1 - Tt)ut) + jt+ix 

-r r ( , Tt \ 
= TtUt - Tt+i + Ut + ^t+ix 



\Ft+i Ff+i 
= lt+i{x - Zt). 

Combining the above inequality with the fact that ipt = LtdQ{zt) + lt{zt) and with (I2ip . we observe: 
Ltdqix) + It+iix) 
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= Ltdgix) + lt{x) + 7t+i (/(xt+i) + (V/(xt+i), X - xt+i)) 

= LtV:^Xx) + V't + {LtVdqizt) + St,x - zt) +7t+i ((V/(a;t+i), a; - xt+i) + f{xt+i)) 

> LtV,,{x) + + 7t+i {f{xt+i) + (V/(a;t+i), a; - xt+i)) 

> LtVzt{x) + rt+i/(a;f+i) + 7t+i (V/(a;t+i), a; - Zt) + Xt- 

With :== {Lt+i - ii)dQ(z4+i), wc thus get: 
■04+1 := m:v[\{Lt+idQ{x) + lt+i{x)} 

~ Lt+idQ{zt+i) + lt+i{zt+i) 

= + LtdQ{zt+i) + lt+i{zt+i) 

> +mm{LtdQix) + lt+iix)} 

> + min {LtV^M + Tt+if{xt+i) + 7t+i (V/(xt+i), x - z^) + xt} • 

Let := i (Lf — it+i) ll^;* — if+i||^„. Using the construction rule for Xt+i and the fact that the 
inequahty Vz{x) > \\x — z||g„ /2 holds for any x € Q and z ^ Q° (this relation follows from the 
strong convexity of rfg), we obtain: 

■0*+! > -^^P + LtVy,^{xt+i) + Tt+if{xt+i) + ^t+i {"^ f{xt+i),xt+i ~ Zt) + Xt 

> '^t^' + Y - ^t+illR" +rt+i/(a;t+i) + 7t+i {\/ f{xt+i),xt+i - zt) + xt 

= ^« + + ^\\zt~ xt+i + Tt+^fixt+i) + Jt+i {Vfixt+i), xt+i - Zt) + Xt 
= di'^ + +xt + Tt+i \\zt - xt+iWlr. + f{xt+i) + Tt {V.f{xt+i), xt+i - Zt) 

As Tt < and as Xt+i — TtZt = (1 — Tt)ut = ut+i — TtXt+i, this inequality yields to: 
V't+i > ^['^ + ^i'^ +Xt + Tt+i [^^^^ Ikt - it+i + fixt+i) + Tt {Vf{xt+i),xt+i - Zt) 

= l^i^^ + -^^^ +Xt+ Tt+l (^^^Y^ ^ ^*+^ ^ /(2;t+l) + (V/(.Tt+i), Ut+l - Xt+l)^ 

It remains to apply ([S]): 

V't+i > i?r^ + # + r,+i/K+i) + Xt 

= X] ^ -^'^^ (^^Q(^fc+i) ~ \ INfc ~ *fc+i||R-^ +^t+if{ut+i). 
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